PoLAPACK: parallel factorization routines with algorithmic blocking
نویسنده
چکیده
LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with blockpartitioned algorithms in order to perform matrix-matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have di erent performance ratios of computation and communication, the optimal computational block sizes are di erent from one another to generate the maximumperformance of an algorithm. Therefore, the data matrix should be distributed with the machine speci c optimal block size before the computation. Too small or large a block size makes getting good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an \algorithmic blocking" on 2-dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the ScaLAPACK factorization routines.
منابع مشابه
Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS
Given an implementation of Distributed BLAS Level 3 kernels, the parallelization of dense linear algebra libraries such as LAPACK can be easily achieved. In this paper, we brieey describe the implementation and performance on the AP1000 of Distributed BLAS Level 3 for the rectangular r s block-cyclic matrix distribution. Then, the parallelization of the central matrix factorization and the trid...
متن کاملA Comparison of Lookahead and Algorithmic Blocking Techniques for Parallel Matrix Factorization
متن کامل
Partitioning and Blocking Issues for a Parallel Incomplete Factorization
The purpose of this work is to provide a method which exploits the parallel blockwise algorithmic approach used in the framework of high performance sparse direct solvers in order to develop robust and efficient preconditioners based on a parallel incomplete factorization.
متن کاملParallel Genetic Algorithm Using Algorithmic Skeleton
Algorithmic skeleton has received attention as an efficient method of parallel programming in recent years. Using the method, the programmer can implement parallel programs easily. In this study, a set of efficient algorithmic skeletons is introduced for use in implementing parallel genetic algorithm (PGA).A performance modelis derived for each skeleton that makes the comparison of skeletons po...
متن کاملMatrix factorization routines on heterogeneous architectures
In this work we consider a method for parallelizing matrix factorization algorithms on systems with Intel © Xeon Phi TM coprocessors. We provide performance results of matrix factorization routines implementing this approach and available in Intel © Math Kernel Library (Intel MKL) on the Intel © Xeon © processor line with Intel Xeon Phi coprocessors. Summary New heterogeneous systems consisting...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Concurrency and Computation: Practice and Experience
دوره 13 شماره
صفحات -
تاریخ انتشار 2001